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Abstract 



We derive generalization error bounds for stationary univariate autoregressive (AR) models. 
We show that imposing stationarity is enough to control the Gaussian complexity without 
further regularization. This lets us use structural risk minimization for model selection. We 
demonstrate our methods by predicting interest rate movements. 

1 Introduction 

In standard machine learning situations, we observe one variable, X, and wish to predict another 
variable, Y, with an unknown joint distribution. Time series models are slightly different: we observe 
a sequence of observations X" = {Xt}"^]^ from some process, and we wish to predict X„+^, for some 
h E N. Throughout what follows, X = {Xt}f^_ao "^ill be a sequence of random variables, i.e., each 
Xt is a measurable mapping from some probability space (f2, J-", P) into a measurable space X. A 
block of the random sequence will be written X^ = {ATtj^^^, where either limit may go to infinity. 

The goal in building a predictive model is to learn a function / which maps the past into 
predictions for the future, evaluating the resulting forecasts through a loss function £{Xn+h, /(X")) 
which gives the cost of errors. Ideally, we would use /* , the function which minimizes the risk 



over all f E J-, the class of prediction functions we can use. 

Since the true joint distribution of the sequence is unknown, so is R{f), but it is often estimated 
with the error on a training sample of size n 



R{f) = E[£{X,,+,,J{X^))], 




(1) 



t=i 



with / being the minimizer of i?„ over T. This is "empirical risk minimization" . 

While Rn{f) converges to R{f) for many algorithms, one can show that when / minimizes 
(fl]), E[i?„(/)] < R{f). This is because the choice of / adapts to the training data, causing the 
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training error to be an over-optimistic estimate of the true risk. Also, training error must shrink as 
model complexity grows. Thus, empirical risk minimization gives unsatisfying results: it will tend 
to overfit the data and give poor out-of-sample predictions. Statistics and machine learning propose 
two mitigation strategies. The first is to restrict the class J- . The second, which we follow, is to 
change the optimization problem, penalizing model complexity. Without the true distribution, the 
prediction risk or generalization error are inaccessible. Instead, the goal is finding bounds on the 
risk which hold with high probability — "probably approximately correct" (PAC) bounds. A typical 
result is a confidence bound on the risk which says that with probability at least 1 — ?7, 

<i?n(7)+'5(C(J-),n,r;), 

where C(-) measures the complexity of the model class J- , and is a function of this complexity, 
the confidence level, and the number of observed data points. 

The statistics and machine learning literature contains many generalization error bounds for both 
classification and regression problems with IID data, but their extension to time series prediction 
is a fairly recent development; in 1997, Vidyasagar [2J named extending such results to time series 
as an important open problem. Yu |22j sets forth many of the uniform ergodic theorems that are 
needed to derive generalization error bounds for stochastic processes. Meir [TT] is one of the first 
papers to construct risk bounds for time series. His approach was to consider a stationary but 
infinite-memory process, and to decompose the training error of a predictor with finite memory, 
chosen through empirical risk minimization, into three parts: 

i?(/p,«,d) = (i?(7p.n.d) - + (i?(/;,„) - + m;) 

where fp,n,d is an empirical estimate based on finite data of length n, finite memory of length p, and 
complexity indexed by d; f* ^ is the oracle with finite memory and given complexity, and /* is the 
oracle with finite memory over all possible complexities. The three terms amount to an estimation 
error incurred from the use of limited and noisy data, an approximation error due to selecting a 
predictor from a class of limited complexity, and a loss from approximating an infinite memory 
process with a finite memory process. 

More recently, others have provided PAC results for non-IID data. Steinwart and Christmann 
[20j prove an oracle inequality for generic regularized empirical risk minimization algorithms learn- 
ing from a-mixing processes, a fairly general sort of weak serial dependence, getting learning rates 
for least-squares support vector machines (SVMs) close to the optimal IID rates. Mohri and Ros- 
tamizadeh |13| prove stability-based generalization bounds when the data are stationary and Lp- 
mixing or /3-mixing, strictly generalizing IID results and applying to all stable learning algorithms. 
(We define /?- mixing below.) Karandikar and Vidyasagar [8j show that if an algorithm is "sub- 
additive" and yields a predictor whose risk can be upper bounded when the data are IID, then 
the same algorithm yields predictors whose risk can be bounded if data are /3-mixing. They use 
this result to derive generalization error bounds in terms of the learning rates for IID data and the 
/3-mixing coefficients. 

All these generalization bounds for dependent data rely on notions of complexity which, while 
common in machine learning, are hard to apply to models and algorithms ubiquitous in the time 
series literature. SVMs, neural networks, and kernel methods have known complexities, so their 
risk can be bounded on dependent data as well. On the other hand, autoregressive moving average 
(ARMA) models, generalized autoregressive conditional heteroskedasticity (GARCH) models, and 
state-space models in general have unknown complexity and are therefore neglected theoretically. 
(This does not keep them from being used in applied statistics, or even in machine learning and 
robotics, e.g., [ITj [TH [HI [21 [10].) Arbitrarily regularizing such models will not do, as often the only 
assumption applied researchers are willing to make is that the time series is stationary. 

We show that the assumption of stationarity regularizes autoregressive (AR) models implicitly, 
allowing for the application of risk bounds without the need for additional penalties. This result 
follows from work in the optimal control and systems design literatures but the application is novel. 
In S|2l we introduce concepts from time series and complexity theory necessary for our results. Section 
[3] uses these results to calculate explicit risk bounds for autoregressive models. Section |4] illustrates 
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the applicability of our methods by forecasting interest rate movements. We discuss our results and 
articulate directions for future research in f|5j 



2 Preliminaries 

Before developing our results, we need to explain the idea of effective sample size for dependent 
data, and the closely related measure of serial dependence called /3-mixing, as well as the Gaussian 
complexity technique for measuring model complexity. 

2.1 Time series 

Because time-series data are dependent, the number of data points n in a sample X" exaggerates 
how much information the sample contains. Knowing the past allows forecasters to predict future 
data (at least to some degree), so actually observing those future data points gives less information 
about the underlying data generating process than in the IID case. Thus, the sample size term 
in a probabilistic risk bound must be adjusted to reflect the dependence in the data source. This 
effective sample size may be much less than n. 

We investigate only stationary /3-mixing input data. We first remind the reader of the notion of 
(strict or strong) stationarity. 

Definition 2.1 (Stationarity). A sequence of random variables X is stationary when all its finite- 
dimensional distributions are invariant over time: for all t and all non-negative integers i and j , 
the random vectors X^^' and X'Jl*^-' have the same distribution. 

From among all the stationary processes, we restrict ourselves to ones where widely-separated 
observations are asymptotically independent. Stationarity does not imply that the random variables 
Xt are independent across time t, only that the distribution of Xt is constant in time. The next 
definition describes the nature of the serial dependence which we are willing to allow. 

Definition 2.2 (/^-Mixing) . Let af = o'(X^) be the a-field of events generated by the appropriate 
collection of random variables. Let Pt be the restriction o/P to 0*^00 j '^t+m be the restriction ofV to 
CT^^, andfti^t+m be the restriction o/P to tT(X^^,X^^). The coefficient of absolute regularity, 
or /3-mixing coefficient, /3(m), is given by 



where \ \ ■ \ \tv is the total variation norm. A stochastic process is absolutely regular, or /3-mixing, if 
j3{m) — >■ as m — > cx). 

This is only one of many equivalent characterizations of /3-mixing (see Bradley [5] for others). 
This definition makes clear that a process is /3-mixing if the joint probability of events which are 
widely separated in time increasingly approaches the product of the individual probabilities, i.e., 
that X is asymptotically independent. Typically, a supremum over t is taken in ([2]), however, this 
is unnecessary for stationary processes, i.e. /3(m) as defined above is independent of t. 

2.2 Gaussian complexity 

Statistical learning theory provides several ways of measuring the complexity of a class of predictive 
models. The results we are using here rely on Gaussian complexity (see, e.g., Bartlett and Mendelson 
[1]), which can be thought of as measuring how well the model can (seem to) fit white noise. 

Definition 2.3 (Gaussian Complexity). Let X" be a (not necessarily IID) sample drawn according 
to V. The empirical Gaussian complexity is 



/3{m) = ||P( X P, 



^mt~\-m\\TV ^ 



(2) 
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where Zi are a sequence of random variables, independent of each other and everything else, and 
drawn from a standard Gaussian distribution. The Gaussian complexity is 



©„(^) = E„ [©„(J-) 

where the expectation is over sample paths generated by v . 

The term inside the supremum, |^ X^iLi -^i/l-^i)!' sample covariance between the noise 

Z and the predictions of a particular model /. The Gaussian complexity takes the largest value of 
this sample covariance over all models in the class (mimicking empirical risk minimization), then 
averages over realizations of the noise. 

Intuitively, Gaussian complexity measures how well our models could seem to fit outcomes which 
were really just noise, giving a baseline against which to assess the risk of over-fitting or failing to 
generalize. As the sample size n grows, for any given / the sample covariance | ^ X]"=i -^j/C^i) | ~^ 0, 
by the ergodic theorem; the overall Gaussian complexity should also shrink, though more slowly, 
unless the model class is so flexible that it can fit absolutely anything, in which case one can conclude 
nothing about how well it will predict in the future from the fact that it performed well in the past. 



2.3 Error bounds for /3-mixing data 

Mohri and Rostamizadeh [T^] present Gaussiar{^ complexity-based error bounds for stationary /3- 
mixing sequences, a generalization of similar bounds presented earlier for the IID case. The results 
are data-dependent and measure the complexity of a class of hypotheses based on the training 
sample. 

Theorem 2.4. Let J- be a space of candidate predictors and let % be the space of induced losses: 

n = {h = t{;f{-)): f eF} 

for some loss function < •) < M . Then for any sample X" drawn from a stationary ^-mixing 
distribution, and for any /i, m > with 2/irn = n and rj > 4(/i — l)/3(m) where /3(m) is the mixing 
coefficient, with probability at least 1 — rj, 



1 /2 

R{f)<Rn{f) + [\) &^{n) + 3M 




and 



R{f) < Rnif) + (1)'^' &,{n) + M^ 



In 2/77' 



2fi ' 

where rj' = rj — A{fi — l)f3{m) in the first case or 77' = r/ — 2(/i — l)/3(m) in the second. 



The generalization error bounds in Theorem |2 . 4| have a straightforward interpretation. The risk of 
a chosen model is controlled, with high probability, by three terms. The first term, the training error, 
describes how well the model performs in-sample. More complicated models can more closely fit any 
data set, so increased complexity leads to smaller training error. This is penalized by the second 
term, the Gaussian complexity. The first bound uses the empirical Gaussian complexity which is 
calculated from the data X" while the second uses the expected Gaussian complexity, and is therefore 
tighter. The third term is the confidence term and is a function only of the confidence level 77 and the 
effective number of data points on which the model was based ji. While it was actually trained on n 
data points, because of dependence, this number must be reduced. This process is accomplished by 
taking ji widely spaced blocks of points. Under the asymptotic independence quantified by (3, this 
spacing lets us treat these blocks as independent. 



^In fact, they present the bounds in terms of the Rademacher complexity, a closely related idea. However, using 
Gaussian complexity instead requi res n o modifications to their results while simplifying the proofs contained here. 



The constant {it/2) ' in Theorem 2.4 is given in Ledoux and Talagrand (9j 
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3 Results 



Autoregressive models are used frequently in economics, finance, and other disciplines. Their main 
utility lies in their straightforward parametric form, as well as their interpretability: predictions for 
the future are linear combinations of some fixed length of previous observations. See Shumway and 
Stoffer [TU] for a standard introduction. 

Suppose that X is a real-valued random sequence, evolving as 



where et has mean zero, finite variance, ej _LL for all i =/= j, and _LL Xj for all i > j. This 
is the traditional specification of an autoregressive order p or AR(p) model. Having observed data 
{Xt}f^i, and supposing p to be known, fitting the model amounts to estimating the coefficients 
{4>}i=i- The most natural way to do this is to use ordinary least squares (OLS). Let 
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Generalization error bounds for these processes follow from an ability to characterize their Gaus- 
sian complexity. The theorem below uses stationarity to bound the risk of AR models. The remainder 
of this section provides the components necessary to prove the results. 

Theorem 3.1. Let _D„ he a sample of length n from a stationary (^-mixing distribution. For any 
/i,m > with 2fim — n and rj > 4(/^ — 1)/3(to), then under squared error loss truncated at M, the 
prediction error of an AR{p) (p > 1) model can be bounded with probability at least 1 — 77 using 



Rif) < RM) 



V7rMlog(p+l) (\^/^ A. A \A 

jJL l<jj'<p+l ^ ^ ^ ' J 



1/2 



3M. 



'In 4/77' 
2/i '■ 



- 4V7rMlog(n+ 1) 
R{f) < Rn{f) + — — 



1/2- 



/in 2/77' 
2/1 ' 



ana X,- IS 



where I — {i : i ~ [a,/2\ + 2ak, < k < fj,}, <pj is the j^^ vertex of the stability domain, 
the i*^ row of the design matrix X. 

For p = 1 slight adjustments are required. We state this result as a corollary. 

Corollary 3.2. Under the same conditions as above, the prediction error of an AR{1) model can be 
bounded with probability at least 1 — 77 using 



Rif)<Rnif) + -j^{Y.^' 



1/2 



M V 2 



3Mi 



/in 4/77' 



2fi ' 



- - 4 /M 
i?(/)<i?n(/) + -^yE 



1/2- 



/in 2/77' 



2/7 
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3.1 Proof components 

To prove Theorem |3.1| it is necessary to control the size of the model class by using the stationarity 
assumption. 

3.1.1 Stationarity controls the hypothesis space 

Define, as an estimator of 4>, 

$ = argmin||Y-X0||2, (3) 


where || • ||2 is the Euclidean normj^ Equation [s] has the usual closed form OLS solution: 

= (X'X)"^X'Y. (4) 

Despite the simplicity of Eq. [4] modellers often require that the estimated autoregressive process 
be stationary. This can be checked algebraically: the complex roots of the polynomial 

Qp(z) - zP - - <i>2zP-^ 0p 

must lie strictly inside the unit circle. Eq. [3] is thus not quite right for estimating a stationary 
autoregressive model, as it does not incorporate this constraint. 

Constraining the roots of Qp{z) constrains the coefficients 0. The set 4> where the process is 
stationary is the stability domain. Bp. Clearly, Bi is just |0i| < 1. Fam and Meditch gives a 
recursive method for determining Bp for general p. In particular, they show that the convex hull 
of the space of stationary solutions is a convex polyhedron with vertices at the extremes of the Bp. 
This convex hull basically determines the complexity of stationary AR models. 



3.1.2 Gaussian complexity of AR models 

Returning to the AR(p) model, it is necessary to find the Gaussian complexity of the function class 



Fp = 1(f) : Xt = 'Pi^t-i and Xt is stationary > 



Theorem 3.3. For the AR(p) model with p > 1, the empirical Gaussian complexity is given by 

(k \ 
^(X„0^.-c/.^.,)M , 
/ 

where (f)^ is the j*'* vertex of the stability domain and X^ is the i^^ row of the design matrix X. 

The proof relies on the following version of Slepian's Lemma (see, for example Ledoux and 
Talagrand [9] or Bartlett and Mendelson [1]). 

Lemma 3.4 (Slepian). Let Vi,...,Vk be random variables such that for all 1 < j < k, Vj = 
EiLi o-ijfji where gi, . . . , (7„ are iid standard normal random variables. Then, 



E [ max Vj\ < \/2(log k) max J E [(V,- - Vf^ 
3 id' V 



Proof of Theorem 3.3 



^ 2 2 " 

6„J" = E sup -y^Qifixi) = E sup -S^ gi{Xi, cf)) 

/ 2 " \ / 2 " 

= E sup ( — / \ = E sup ( — 2. Qi'^ij 4> 



There are other ways to estimate AR models, but they typically amount to very similar optimization problems. 
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Figure 1: Growth rate of 10-year treasury bond 



where the last equahty fohows from Theorem 12 in T. By standard resuhs from convex optimization, 
this supremum is attained at one of the vertices of conv{Bp). Therefore, 

2 " 

©„( J-) = E [ max - V 5, (X„ )] , 

i—l 

where 4>j is the j*'* vertex of conv{Bp). Let Vj — J^i^^i 'Pj)- Then by the Lemma 



3.4 



< — (logp + 1)1/2 l^^y^ _ 



2a/2 



2V2 



(logp + 1)1/2 j^^^ 



E 



(l0gp+ 1)1/2 



max 



□ 



where is the i"'-entry of the design matrix. 

When p = I, as in Corohary |3.2[ we can calculate the complexity directly. The proof's last line 
shows that we are essentially interested in the diameter of the stability domain Bp projected onto 
the column space of X, which gives a tighter bound than that from the general results on linear 
prediction in e.g. Kakade et al. [7]. 

Since we care about the complexity of the model class viewed through the loss function £, 
we must also account for this additional complexity. For c-Lipschitz loss functions, this just means 
multiplying 0„(J^) by 2c. 



4 Application 

We illustrate our results by predicting interest rate changes — specifically, the 10-year Treasury 
Constant Maturity Rate series from the Federal Reserve Bank of St. Louis' FRED databas^ — 
recorded daily from January 2, 1962 to August 31, 2010. Transforming the series into daily natural- 
log growth rates leaves n — 12150 observations (Figure [T]). The changing variance apparent in the 
figure is why interest rates are typically forecast with GARCH(1,1) models. For this illustration 
however, we will use an AR(p) model, picking the memory order p by the risk bound. 
Figure [2] shows the training error 

^ ^ 1 " ^ 
n — n ^-^ 



^Available at http: //research, stlouisfed. org/fred2/series/DGS10?cid=l 15 
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Figure 3: Generalization error bound for different model orders 



where Xt is the t*^ datapoint, and Xt is the model's prediction. _R„ shrinks as the order of the 
model (p) grows, as it must since ordinary least squares minimizes i?„ for a given p. Also shown is 
the gap between the AIC for different p and the lowest attainable value; this would select an AR(36) 
model. 

A better strategy uses the probabilistic risk bound derived above. The goal of model selection 
is to pick, with high probability, the model with the smallest risk; this is Vapnik's structural risk 
minimization principle. Here, it is clear that AIC is dramatically overfitting. The optimal model 
using the risk bound is an AR(1). Figure [s] plots the risk bound against p with the loss function 
truncated at 0.05. (No daily interest rate change has ever had loss larger than 0.034, and results are 
fairly insensitive to the level of the loss cap.) This bound says that with 95% probability, regardless 
of the true data generating process, the AR(1) model will make mistakes with squared error no larger 
than 0.0079. If we had instead predicted with zero, this loss would have occurred three times. 



One issue with Theorem 2.4 is that it requires knowledge of the /3-mixing coefficients, j3{m). 
Of course, the dependence structure of this data is unknown, so we calculated it under generous 
assumptions on the data generating process. In a homogeneous Markov process, the /3-mixing 
coefficients work out to 

/3(to) = J 7r{dx)\\P'^{x,-)-7T\\Tv 

where P"^{x, •) is the m-step transition operator and tt is the stationary distribution [141 fi]. Since 
AR models are Markovian, we estimated an AR(g) model with Gaussian errors for q large and 
calculated the mixing coefficients using the stationary and transition distributions. To create the 
bound, we used m = 7 and p, = 867. We address non-parametric estimation of /3-mixing coefficients 
elsewhere [Anon.]. 



5 Discussion 

We have constructed a finite-sample predictive risk bound for autoregressive models, using the 
stationarity assumption to constrain OLS estimation. Interestingly, stationarity — a common as- 
sumption among applied researchers — constrains the model space enough to yield bounds without 
further regularization. Moreover, this is the first predictive risk bound we know of for any of the 
standard models of time series analysis. 
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Traditionally, time series analysts have selected models by blending empirical risk minimization, 
more-or-less quantitative inspection of the residuals (e.g., the Box-Ljung test; see [H]), and AIC. 
In many applications, however, what really matters is prediction, and none of these techniques, 
including AIC, controls generalization error, especially with mis-specification. (Cross-validation is a 
partial exception, but it is tricky for time series; see |16j and references therein.) Our bound controls 
prediction risk directly. Admittedly, our bound covers only univariate autoregressive models, the 
plainest of a large family of traditional time series models, but we believe a similar result will cover 
the more elaborate members of the family such as vector autoregressive, autoregressive-moving 
average, or autoregressive conditionally heteroskedastic models. While the characterization of the 
stationary domain from [5] on which we relied breaks down for such models, they are all variants of 
the linear state space model [5], whose parameters are restricted under stationarity, and so we hope 
to obtain a general risk bound, possibly with stronger variants for particular specifications. 
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